Overview and aims

The goal for this module is to help you to learn the basics of using R so you can go away and start to tackle your own data. R is a programming language that was first launched back in 2000. There are many reasons to use R:

  • it is easy to use and has excellent graphic capabilities
  • it has lots of new statistical methods to be used in a straightforward manner
  • it’s open source and free.
  • because of ⬆️, it is supported by a large user network.
  • because of ⬆️⬆️, it has lots and lots of new analysis packages being released all of the time
  • it is also old, but because of ⬆️⬆️⬆️, you can use the more intuitive new functions which does things much faster.
  • it can be run on Windows, Linux and Mac

Within the research discipline, what most people think of data analysis is that they think of the following below - turning data, perhaps curated in a spreadsheet like excel, to some shiny figures that are presentable.

Figure 1 - Data to figure.

Figure 1 - Data to figure.



There is now the exciting research discipline of Data Science that allows you to “turn raw data into understanding, insight and knowledge”. Before we start, we need to cite the materials that we have been inspired by and reformatted to make use of helminth related data in this course:

These books have been designed for absolute beginners to learn all the basics of R. They are free online and we encourage you to read further in your own time.

The goal of data analysis is to explore, gain insights, and interpret your data. A analysis cycle usually looks something like this:


Figure 2 - Concept of a Program from (R4DS)

Figure 2 - Concept of a Program from (R4DS)



Part 1: Introduction to R

In this module, we are introducing you to the programming language R via RStudio. So what is the difference between the two? You should think of RStudio as a car’s shiny dashboard and R as the engine. “More precisely, R is a programming language that runs computations, while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.” (ModernDive) .

R can certainly be run independently from RStudio - you can type “R” on the command line (providing it is installed - it is installed on your VM) - and the R command line prompt will appear. Sometimes in may in fact be necessary to run from the commandline, especially if you have a large computational task that requires sending jobs to a high preformance computing cluster. However, for the most part, you will find RStudio will be sufficient for what you need. If you want to download RStudio later on your own personal computer, R will automatically be installed with it in the background.

Let’s get started by opening Rstudio on your VM by double-clicking on the RStudio icon on the VM desktop.

After RStudio opens, you should see something similar like Figure 3


Figure 3 - RStudio interface to R.

Figure 3 - RStudio interface to R.



R data structure

In order to start appreciating the power of R, the data needs to be organised or “tidied” in such a way so that calculations can be performed on them. There are many data types and structures in R. For simplicity we will introduce only two data structures: vector and data frame, and two data types: numeric (numbers) and character (text based).

A vector is the simplest data structure in R. It essentially contains a series of values of the same type. Data frames are a collection of vectors in columns. Think of it like Excel spreadsheets, where each row represents a sample, and each variable measured in a column. To iterate:

Rows represent samples E.g., sample A in Row 1, sample B in Row 2, and so on….

And all the values of the same variable must go in the same column. We can have multiple columns with multiple, different data types, for example, age, sex, RPKM, numbers, whatever.

We note here that there are newer and more intuitive ways of manipulation of data frame-type data, and we will use tibble from the dplyr package later in the module.

Figure 4 - Two data struture used in this module

Figure 4 - Two data struture used in this module





Functions

Note in the previous code block, we actually introduced the function c. A function provides a simplfied way to perform a more complex operation on data. We can write our own functions (which is perhaps beyond the scope of this tutorial), however, R conveniently contains many built-in functions that are waiting to be discovered! A function is applied using round brackets , and can take arguments and options.

function (arg1, arg2, arg3… , option1=,option2=...)



We will end this first section with a plot function, based on what you have been taught today.

# Lets set two vectors of 25 numbers, in two different ways. 
x <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)

# note, we can do the same using the "rep" function:
x <- rep(1:3, each=10)

# for our second vector, lets generate some random numbers sampled from a normal 
# distribution using the "rnorm" function
y <- rnorm(30)


# Lets generate a simple plot using "baseR" graph functions to visualise the relationship 
# between x and y. 
# Note that each set of numbers will be paired according to their order in the vector
plot(x, y)

# Lets add some colour using the "col" option 
plot(x, y, col="red")

# We can also colour using our data. In our current data, the x variable is categorical, so this 
# might be good to set as a colour to distingush each group.
plot(x, y, col=x)

# We can also change the size and shape of the points using the cex and pch options, respectively
plot(x, y, col=x, pch=20, cex=2)

# Try changing the "pch" value from between 1 and 20 and see what you come up with!


# many functions have additional parameters. Let's try a boxplot. 
boxplot(y ~ x)

#> note we have used a "~" rather than a ",". This indicates we want to show "y" data for 
# each "x" category as one might expect in a box plot. 
#> What happens if you use a "," instead? And why?


# I like hotpink
boxplot(y ~ x, col = c("hotpink", "yellow", "red") )

boxplot(y ~ x, col = c("hotpink", "yellow", "red"), main="My first plot"  )


# can use the ? sign to find out more about function
?boxplot


# Try a few things for yourself:
# 1. Above, we used the "rnorm" to generate some random numbers samples from a normal distribution. 
#   Let's visualise that this in fact a normal distribution. Can you set a new variable containing 
#   1000 randomly sampled numbers, and visualise this using the "hist" function?







Part 2: Data Wrangling



A large part of any analysis is data wrangling, which is a term used in data science describing the process of getting your raw data into a format that can be used in R for further analysis and visualisation. Raw data is often very messy, will likely require some sort of manipulation to get it into a suitable format, and may represent a significant amount of the overall analysis time! In the old days, we typically copied and pasted from different excel spreadsheets. The problem with this approach is that if new data arrives, the whole manual process has to be repeated again, which is time consuming and may lead to human caused errors. In general, to enable reproducibility we want to minimise the amount of manual manipulation and instead, use commands that allow us to automate these processes.

In this section of the module, you will see that a lot of actions can be performed using existing in-built functions in R.

Figure 5 - Data Wragling R4DS

Figure 5 - Data Wragling R4DS

Packages

R consists of a core set of in-built functions which can be supplemented with additional packages. Packages are collections of R functions, data, and compiled code, with a well-defined format that ensures easy installation and a basic standard set of documentation, enabling portability and reliability.

Here, we will use the tidyverse package, which is highly recommended for manipulating data structures.



Data import

In this exercise, we will be using helminth prevalence data downloaded from the Global Atlas of Helminth Infection (GAHI; http://www.thiswormyworld.org ). There will be three files which are available to download from the Github page.

  • Ascaris_prevalence_data.txt
  • Hookworms prevalence_data.txt
  • Schisoma_mansoni_prevalence_data.txt

They are text files where each column of data is separated by tab, i.e., a tab delimiter. This allows R to know how data is separated in each column. Note that there are different types of delimiters, and that these often cause problems with getting data in R in the first place if not correctly set or formatted correctly in the data. Worth paying attention to this in your own data, and for ease, try sticking to one type.

Let’s get started. First we are going to download and read the Ascaris_prevalence.txt file into R from GitHub using the read_delim function in R. We need to tell R that is it tab delimited. Once it’s read successfully, the file will be a tibble format, which allows a lot of nice functions to act on it.



## # A tibble: 989 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
##  2 Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
##  3 Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
##  4 Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
##  5 Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
##  6 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
##  7 Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
##  8 Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
##  9 Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
## 10 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
## # … with 979 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
##  [1] "Region"               "Country"              "ISO_code"            
##  [4] "ADM1"                 "Latitude"             "Longitude"           
##  [7] "Year_start"           "Year_end"             "Age_start"           
## [10] "Age_end"              "Individuals_surveyed" "Number_Positives"    
## [13] "Prevalence"



Data tranformation

In this section, you are going to learn five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with a sixth function - group_by() - which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. Overall, these functions provide the verbs for a language of data manipulation. (R4DS)



Filtering

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the tibble.

Figure 6 - A schematic diagram of filter()

Figure 6 - A schematic diagram of filter()



## # A tibble: 856 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
##  2 Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
##  3 Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
##  4 Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
##  5 Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
##  6 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
##  7 Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
##  8 Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
##  9 Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
## 10 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
## # … with 846 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
## # A tibble: 856 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
##  2 Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
##  3 Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
##  4 Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
##  5 Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
##  6 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
##  7 Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
##  8 Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
##  9 Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
## 10 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
## # … with 846 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
## # A tibble: 110 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Ghana   GH       Asha…     7.30    -1.69        2008     2008
##  2 Africa Ghana   GH       Cent…     5.13    -1.20        2008     2008
##  3 Africa Ghana   GH       Cent…     5.52    -1.33        2008     2008
##  4 Africa Ghana   GH       Cent…     5.58    -1.47        2008     2008
##  5 Africa Ghana   GH       Nort…     9.08    -1.83        2008     2008
##  6 Africa Ghana   GH       Asha…     6.59    -1.86        2008     2008
##  7 Africa Ghana   GH       Uppe…    11.0     -0.484       2008     2008
##  8 Africa Ghana   GH       Asha…     6.58    -1.12        2008     2008
##  9 Africa Ghana   GH       Nort…     8.47    -2.17        2008     2008
## 10 Africa Ghana   GH       Nort…     9.68     0.173       2008     2008
## # … with 100 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
## # A tibble: 951 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Burundi BI       Buru…    -3.86      29.5       2007     2007
##  2 Africa Burundi BI       Cibi…    -2.84      29.3       2007     2007
##  3 Africa Burundi BI       Ngozi    -2.84      29.7       2007     2007
##  4 Africa Burundi BI       Buba…    -3.16      29.4       2007     2007
##  5 Africa Burundi BI       Buju…    -3.42      29.4       2007     2007
##  6 Africa Burundi BI       Mwaro    -3.53      29.7       2007     2007
##  7 Africa Burundi BI       Maka…    -4.20      29.7       2007     2007
##  8 Africa Burundi BI       Muyi…    -2.90      30.4       2007     2007
##  9 Africa Burundi BI       Kiru…    -2.53      30.4       2007     2007
## 10 Africa Burundi BI       Cank…    -3.23      30.6       2007     2007
## # … with 941 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
## # A tibble: 951 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Burundi BI       Buru…    -3.86      29.5       2007     2007
##  2 Africa Burundi BI       Cibi…    -2.84      29.3       2007     2007
##  3 Africa Burundi BI       Ngozi    -2.84      29.7       2007     2007
##  4 Africa Burundi BI       Buba…    -3.16      29.4       2007     2007
##  5 Africa Burundi BI       Buju…    -3.42      29.4       2007     2007
##  6 Africa Burundi BI       Mwaro    -3.53      29.7       2007     2007
##  7 Africa Burundi BI       Maka…    -4.20      29.7       2007     2007
##  8 Africa Burundi BI       Muyi…    -2.90      30.4       2007     2007
##  9 Africa Burundi BI       Kiru…    -2.53      30.4       2007     2007
## 10 Africa Burundi BI       Cank…    -3.23      30.6       2007     2007
## # … with 941 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>



Select

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

Figure 7 - A schematic diagram of select()

Figure 7 - A schematic diagram of select()



## # A tibble: 989 x 2
##    Country Prevalence
##    <chr>        <dbl>
##  1 Angola      0.143 
##  2 Angola      0.167 
##  3 Angola      0.0964
##  4 Angola      0.381 
##  5 Angola      0.0952
##  6 Angola      0.05  
##  7 Angola      0.170 
##  8 Angola      0     
##  9 Angola      0.0635
## 10 Angola      0.677 
## # … with 979 more rows



Mutate

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

mutate() always adds new columns at the end of your dataset, so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to see all the columns is View().

Figure 8 - A schematic diagram of mutate()

Figure 8 - A schematic diagram of mutate()

## # A tibble: 989 x 14
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
##  2 Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
##  3 Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
##  4 Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
##  5 Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
##  6 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
##  7 Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
##  8 Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
##  9 Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
## 10 Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
## # … with 979 more rows, and 6 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>,
## #   Prevalence_2 <dbl>
## # A tibble: 989 x 2
##    Prevalence Prevalence_2
##         <dbl>        <dbl>
##  1     0.143        0.143 
##  2     0.167        0.167 
##  3     0.0964       0.0964
##  4     0.381        0.381 
##  5     0.0952       0.0952
##  6     0.05         0.05  
##  7     0.170        0.170 
##  8     0            0     
##  9     0.0635       0.0635
## 10     0.677        0.677 
## # … with 979 more rows



Arrange

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

Figure 9 - A schematic diagram of arrange()

Figure 9 - A schematic diagram of arrange()

## # A tibble: 989 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
##  2 Africa Angola  AO       Bengo    -8.39      13.8       2010     2010
##  3 Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
##  4 Africa Angola  AO       Bengo    -8.54      13.9       2010     2010
##  5 Africa Angola  AO       Bengo    -8.64      13.8       2010     2010
##  6 Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
##  7 Africa Angola  AO       Bengo    -8.44      13.8       2010     2010
##  8 Africa Burundi BI       Buru…    -3.86      29.5       2007     2007
##  9 Africa Burundi BI       Cibi…    -2.78      29.1       2007     2007
## 10 Africa Burundi BI       Maka…    -4.30      29.6       2007     2007
## # … with 979 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
## # A tibble: 989 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 East … Philip… PH       Regi…    16.0     120.         2000     2000
##  2 East … China   CN       Yunn…    21.8     100.         2006     2006
##  3 Africa Uganda  UG       Kiso…    -1.34     29.8        2005     2005
##  4 Africa Nigeria NG       Ogun      6.44      4.41       2003     2005
##  5 Africa Nigeria NG       Ogun      7.23      3.53       2003     2005
##  6 Africa Uganda  UG       Kiso…    -1.19     29.7        2005     2005
##  7 East … Philip… PH       <NA>     14.4     121.         2000     2000
##  8 East … Philip… PH       Regi…    16.0     120.         2000     2000
##  9 East … Philip… PH       Regi…    15.3     121.         2000     2000
## 10 Africa Uganda  UG       Kaba…    -1.23     29.9        2006     2006
## # … with 979 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
## # A tibble: 989 x 13
##    Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
##    <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
##  1 Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
##  2 Africa Angola  AO       Bengo    -8.39      13.8       2010     2010
##  3 Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
##  4 Africa Angola  AO       Bengo    -8.54      13.9       2010     2010
##  5 Africa Angola  AO       Bengo    -8.64      13.8       2010     2010
##  6 Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
##  7 Africa Angola  AO       Bengo    -8.44      13.8       2010     2010
##  8 Africa Angola  AO       Bengo    -8.52      13.7       2010     2010
##  9 Africa Angola  AO       Bengo    -8.57      13.5       2010     2010
## 10 Africa Angola  AO       Bengo    -8.39      13.7       2010     2010
## # … with 979 more rows, and 5 more variables: Age_start <dbl>, Age_end <dbl>,
## #   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>



Group_by and Summarise

The last key verb is summarise(). It collapses a data frame to a single row.

Together, group_by() and summarise() will allow you to produce grouped summaries, one of the more useful features of dplyr and one you will likely use a lot.

Figure 10 - A schematic diagram of group_by()

Figure 10 - A schematic diagram of group_by()



Figure 11 - Closer inspection of by_country

Figure 11 - Closer inspection of by_country



summarise() collapses a data frame to a single row and do calculations depending on the given function. It will be very useful if you pair the function with group_by(). Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by country, we get the average Prevalence by country

## # A tibble: 1 x 1
##   average.Prevalence
##                <dbl>
## 1              0.102

This one number is the mean of Prevalence across entire studies, which may not be very informative if we were looking at regional differences. Hence we do the following:

Figure 12 - A schematic diagram of group_by() and summarise()

Figure 12 - A schematic diagram of group_by() and summarise()

## # A tibble: 13 x 2
##    Country       average.Prevalence
##  * <chr>                      <dbl>
##  1 Angola                   0.123  
##  2 Burundi                  0.155  
##  3 Cameroon                 0.592  
##  4 China                    0.926  
##  5 Cote D'Ivoire            0.051  
##  6 Eritrea                  0.00188
##  7 Ghana                    0.0226 
##  8 Malawi                   0.0127 
##  9 Nigeria                  0.443  
## 10 Philippines              0.283  
## 11 Senegal                  0.0942 
## 12 Sierra Leone             0.0710 
## 13 Uganda                   0.0633

Grouped summaries, generated using group_by() and summarise(), provide a really useful tool for exploring data when working with dplyr. But before we go any further with this, we need to introduce a powerful new idea: the pipe.



Pipe

Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this.

## # A tibble: 13 x 3
##    Country       average.Prevalence Num.sites
##  * <chr>                      <dbl>     <int>
##  1 Angola                   0.123          38
##  2 Burundi                  0.155          22
##  3 Cameroon                 0.592           1
##  4 China                    0.926           1
##  5 Cote D'Ivoire            0.051           1
##  6 Eritrea                  0.00188        40
##  7 Ghana                    0.0226         77
##  8 Malawi                   0.0127         33
##  9 Nigeria                  0.443          20
## 10 Philippines              0.283         132
## 11 Senegal                  0.0942        106
## 12 Sierra Leone             0.0710         52
## 13 Uganda                   0.0633        466
## # A tibble: 13 x 3
##    Country       average.Prevalence Num.sites
##    <chr>                      <dbl>     <int>
##  1 China                    0.926           1
##  2 Cameroon                 0.592           1
##  3 Nigeria                  0.443          20
##  4 Philippines              0.283         132
##  5 Burundi                  0.155          22
##  6 Angola                   0.123          38
##  7 Senegal                  0.0942        106
##  8 Sierra Leone             0.0710         52
##  9 Uganda                   0.0633        466
## 10 Cote D'Ivoire            0.051           1
## 11 Ghana                    0.0226         77
## 12 Malawi                   0.0127         33
## 13 Eritrea                  0.00188        40
## # A tibble: 10 x 3
##    Country      average.Prevalence Num.sites
##    <chr>                     <dbl>     <int>
##  1 Nigeria                 0.443          20
##  2 Philippines             0.283         132
##  3 Burundi                 0.155          22
##  4 Angola                  0.123          38
##  5 Senegal                 0.0942        106
##  6 Sierra Leone            0.0710         52
##  7 Uganda                  0.0633        466
##  8 Ghana                   0.0226         77
##  9 Malawi                  0.0127         33
## 10 Eritrea                 0.00188        40

Notice that we have used four steps to prepare the tibble by_country_summary_sortHighest_removeSomeCountry:

  1. Group by Country to produce a tibble by_country
  2. Summarise to compute average prevalence average.Prevalence, and number of sites Num.sites to produce a tibble by_country_summary
  3. Arrange the tibble so that the highest average Prevalence is displayed top. by_country_summary_sortHighest tibble is produced
  4. Filter countries where only one site was investigated to produce a final tibble by_country_summary_sortHighest_removeSomeCountry

This code is a little frustrating to write because we have to give each intermediate data frame a name and data stored in a variable, even though we don’t care about them. Naming things can be annoying and time consuming, so this slows down our analysis.

There’s another way to tackle the same problem with the pipe, %>%. The same code above can be rewritten as follows:

## # A tibble: 10 x 3
##    Country      average.Prevalence Num.sites
##    <chr>                     <dbl>     <int>
##  1 Nigeria                 0.443          20
##  2 Philippines             0.283         132
##  3 Burundi                 0.155          22
##  4 Angola                  0.123          38
##  5 Senegal                 0.0942        106
##  6 Sierra Leone            0.0710         52
##  7 Uganda                  0.0633        466
##  8 Ghana                   0.0226         77
##  9 Malawi                  0.0127         33
## 10 Eritrea                 0.00188        40


This focuses on the transformations, not what’s being transformed, which makes the code easier to read. You can read it as a series of imperative statements, one per line of code: group_by(), then summarise(), then arrange(), then filter(). As suggested by this reading, a good way to pronounce %>% when reading code is “then”.

Note also in the code that we have used an tab indent on the second and subsequent lines to show that this is a single block of code.

The point of the pipe is to help you write code in a way that is easier to read and understand. Let’s try one more example.

## # A tibble: 989 x 2
##    Country     Prevalence
##    <chr>            <dbl>
##  1 Philippines      0.952
##  2 China            0.926
##  3 Uganda           0.893
##  4 Nigeria          0.871
##  5 Nigeria          0.847
##  6 Uganda           0.825
##  7 Philippines      0.811
##  8 Philippines      0.771
##  9 Philippines      0.767
## 10 Uganda           0.762
## # … with 979 more rows

Working with the pipe is one of the key criteria for belonging to the tidyverse.




Combine table

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data: filtering joins, mutating joins and set operations. Here we are going to demonstrate the bind_rows() function, which is part of set operations as shown in Figure 13 below.

Figure 13 - Functions useful to combine cases/rows`

Figure 13 - Functions useful to combine cases/rows`



Part 3: Visusalisation using ggplot2

Data visualisation ggplot2

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

We will go and visualise the data using ggplot2 that we’ve been wrangling in the previous sections. ggplot2 is part of the tidyverse package which you’ve already loaded this morning.

“R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.” R4DS

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = ascaris) creates an empty graph, but it’s not very interesting so we are not going to show it here.

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. We will show a few more examples later.

Let’s look at the ascaris tibble in more details, using ggplot to explore. Among the variables we have the number of individuals surveyed (Individuals_surveyed) and the number of positive infection cases (Number_Positives).

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes() - the aesthetics - and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, ascaris.

To begin to explore the ascaris data, run this code to put Individuals_surveyed on the x-axis and Number_Positives on the y-axis:

The plot shows a positive relationship between Individuals_surveyed and Number_Positives. In other words, more individuals surveyed also means more cases. Does this confirm or refute our hypothesis about number of individuals and positives?

Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.


Figure 14 - A generic usage of ggplot2 from R4DS`

Figure 14 - A generic usage of ggplot2 from R4DS`



Aesthetic mappings

You can add a third variable, like country, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties.

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to each variable to reveal the country of each site.



Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, which are subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.



Boxplot

From the quick scatterplot it seems that there are variations in patterns of parasite prevalence in different countries. Another useful plot would be a boxplot which can be called by the geom_boxplot() function. Here we list out the commands that improve the plot bit by bit to generate a polished figure.



More data wragling + ggplot

Using what we’ve just learnt, now let’s go through a case study where we do some data wrangling and produce a final plot.




## # A tibble: 35 x 6
## # Groups:   disease [3]
##    disease Country       sites total.ind total.positives average.Prevalence
##    <chr>   <chr>         <int>     <dbl>           <dbl>              <dbl>
##  1 ascaris Angola           38      1647             263            0.123  
##  2 ascaris Burundi          22      1314             205            0.155  
##  3 ascaris Cameroon          1        76              45            0.592  
##  4 ascaris China             1       215             199            0.926  
##  5 ascaris Cote D'Ivoire     1       118               6            0.051  
##  6 ascaris Eritrea          40      1607               3            0.00188
##  7 ascaris Ghana            77      4518             104            0.0226 
##  8 ascaris Malawi           33      1965              18            0.0127 
##  9 ascaris Nigeria          20      1013             480            0.443  
## 10 ascaris Philippines     132     12847            4005            0.283  
## # … with 25 more rows
## # A tibble: 2 x 2
##   Country      num.diseases
##   <chr>               <int>
## 1 Sierra Leone            3
## 2 Uganda                  3




Summary

In this module, we have shown you:

  • Basics of R
  • importing / exporing data
  • manipulating / tidying data
  • visualising data

But there’s so much more. In addition to the R4DS and ModernDive that were mentioned above, there are extensive resources online that will enable you to tackle more challenging data. Here we list a few more below that we find useful and hope you can go home and start using R more extensively in your daily research tasks.